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, Abstract. The number of triangles is a computationally expensive graph statistic which is frequently used in complex net- 

work analysis (e.g., transitivity ratio), in various random graph models (e.g., exponential random graph model) and in im- 
portant real world applications such as spam detection, uncovering of the hidden thematic structure of the Web and link 
recommendation. Counting triangles in graphs with millions and billions of edges requires algorithms which run fast, use 
small amount of space, provide accurate estimates of the number of triangles and preferably are parallelizable. 
In this paper we present an efficient triangle counting algorithm which can be adapted to the semistreaming model [15]. 
The key idea of our algorithm is to combine the sampling algorithm of [34,35] and the partitioning of the set of vertices 
into a high degree and a low degree subset respectively as in [2], treating each set appropriately. We obtain a running time 
O + hlZ-^Issii j and an e approximation (multiplicative error), where n is the number of vertices, m the number of 
edges and A the maximum number of triangles an edge is contained. Furthermore, we show how this algorithm can be adapted 
to the semistreaming model with space usage O ^m^^^ logn + ' t^i°^ " ^ '"^'^ ^ constant number of passes (three) over 
the graph stream. We apply our methods in various networks with several millions of edges and we obtain excellent results. 
Finally, we propose a random projection based method for triangle counting and provide a sufficient condition to obtain an 

. estimate with low variance. 
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'xj- 1 Introduction 

p : 

Graphs are ubiquitous: the Internet, the World Wide Web (WWW), social networks, protein interaction 
2 networks and many other complicated structures are modeled as graphs [ ]. The problem of counting 
^ . subgraphs is one of the typical graph mining tasks that has attracted a lot of attention. The most basic, 

> \ non-trivial subgraph, is the triangle. Given a simple, undirected graph G{V, E), a triangle is a three node 

fully connected subgraph. Many social networks are abundant in triangles, since typically friends of friends 
^ tend to become friends themselves [_ ]. This phenomenon is observed in other types of networks as well 
(biological, online networks etc.) and is one of the main reasons which gave rise to the definitions of the 
transitivity ratio and the clustering coefficients of a graph in complex network analysis [27]. Triangles are 
used in several applications such as uncovering the hidden thematic structure of the web [13], as a feature 
to assist the classification of web activity [ ] and for link recommendation in online social networks [36]. 
Furthermore, triangles are used as a network statistic in the exponential random graph model [ i ]. 

In this paper, we propose a new triangle counting method which provides an e approximation to the 

number of triangles in the graph and runs in 0(m+ time, where n is the number of vertices, 

m the number of edges and A the maximum number of triangles an edge is contained. The key idea of the 
method is to combine the sampling scheme introduced by Tsourakakis et al. in [34,35] with the partitioning 
idea of Alon, Yuster and Zwick [ ] in order to obtain a more efficient sampling scheme. Furthermore, we 
show that this method can be adapted to the semistreaming model with a constant number of passes and 
O ^m^/^ logn + l!fL-^2EL!l^ space. We apply our methods in various networks with several millions of 
edges and we obtain excellent results both with respect to the accuracy and the running time. Furthermore, 



we optimize the cache properties of the code in order to obtain a significant additional speedup. Finally, 
we propose a random projection based method for triangle counting and provide a sufficient condition to 
obtain an estimate with low variance. 

The paper is organized as follows: Section 2 presents briefly the existing work and the theoretical 
background, Section 3 presents our proposed method and Section 4 presents the experimental results on 
several large graphs. In Section 5 we provide a sufficient condition for obtaining a concentrated estimate of 
the number of triangles using random projections and in Section 6 we conclude and provide new research 
directions. 

2 Preliminaries 

In this section, we briefly present the existing work on the triangle counting problem and the necessary 
theoretical background for our analysis, namely a version of the Chemoff bounded and the Johnson- 
Lindenstrauss lemma. Table 1 lists the symbols used in this paper. 

2.1 Existing work 

There exist two categories of triangle counting algorithms, the exact and the approximate. It is worth 
noting that for the applications described in Section 1 the exact number of triangles in not crucial. Thus, 
approximate counting algorithms which are faster and output a high quality estimate are desirable for the 
practical applications in which we are interested in this work. 

The state of the art algorithm is due to Alon, Yuster and Zwick [""] and runs in 0(m~), where 
currently the fast matrix multiplication exponent uj is 2.371 [ ]. Thus, the Alon et al. algorithm currently 
runs in 0(m^-^^) time. Algorithms based on matrix multiplication are not used in practice due to the high 
memory requirements. Even for medium sized networks, matrix-multiplication based algorithms are not 
applicable. In planar graphs, triangles can be found inO(n) time [17,28]. Furthermore, in [17] an algorithm 
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which finds a triangle in any graph in 0(m2 ) time is proposed. This algorithm can be extended to list the 
triangles in the graph with the same time complexity. Even if listing algorithms solve a more general 
problem than the counting one, they are preferred in practice for large graphs, due to the smaller memory 
requirements compared to the matrix multiplication based algorithms. Simple representative algorithms 
are the node- and the edge-iterator algorithms. The former counts for each node number of triangles it's 
involved in, which is equivalent to the number of edges among its neighbors, whereas in the latter, the 
algorithm counts for each edge (i, j) the common neighbors of nodes i, j. Both of these algorithms have 
the same asymptotic complexity 0(mn), which in dense graphs results in 0{n^) time, the complexity of 
the naive counting algorithm. Practical improvements over this family of algorithms have been achieved 
using various techniques, such as hashing and sorting by the degree [24,30]. 

On the approximate counting side, most of the triangle counting algorithms have been developed in 
the streaming setting. In this scenario, the graph is represented as a stream. Two main representations of 
a graph as a stream are the edge stream and the incidence stream. In the former, edges are arriving one at 
a time. In the latter scenario all edges incident to the same vertex appear successively in the stream. The 
ordering of the vertices is assumed to be arbitrary. A streaming algorithm produces a relative e approxima- 
tion of the number of triangles with high probability, making a constant number of passes over the stream. 
However, sampling algorithms developed in the streaming literature can be applied in the setting where 
the graph fits in the memory as well. Monte Carlo sampling techniques have been proposed to give a fast 
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Symbol 


Definition 


Gi[n],E) 


undirected simple graph with n vertices labeled 

1, 2, ..,n 

and edge set E 


m 


number of edges in G 


t 


number of triangles in G 


deg{u) 


degree of vertex u 


A{u,v) 


# triangles 

containing vertices u and v 


A 


max<.gB(G) A{e) 


P 


sparsification parameter 



Table 1. Table of symbols 



estimate of the number of triangles. According to such an approach, a.k.a. naive sampling [3 \], we choose 
three nodes at random repetitively and check if they form a triangle or not. If one makes 

r = log(-)^(l + ) 

independent trials where Tj is the number of triples with i edges and outputs as the estimate of triangles 
the random variable Tg equaling to the fractions of triples picked that form triangles times the total number 
of triples ((3)), then 

(1 - e)T3 <n<il + e)T3 

with probability at least 1 — 6. This is not suitable when T3 = o(n^), which is often the case when dealing 
with real- world networks. 

In [4] the authors reduce the problem of triangle counting efficiently to estimating moments for a 
stream of node triples. Then, they use the Alon-Matias-Szegedy algorithms [1] (a.k.a. AMS algorithms) 
to proceed. The key is that the triangle computation reduces in estimating the zero-th, first and second fre- 
quency moments, which can be done efficiently. Again, as in the naive sampling, the denser the graph the 
better the approximation. The AMS algorithms are also used by [ >], where simple sampling techniques 
are used, such as choosing an edge from the stream at random and checking how many common neighbors 
its two endpoints share considering the subsequent edges in the stream. Along the same lines, [7] proposed 
two space-bounded sampling algorithms to estimate the number of triangles. Again, the underlying sam- 
pling procedures are simple. E.g., for the case of the edge stream representation, they sample randomly an 
edge and a node in the stream and check if they form a triangle. Their algorithms are the state-of-the-art 
algorithms to the best of our knowledge. The three-pass algorithm presented therein, counts in the first 
pass the number of edges, in the second pass it samples uniformly at random an edge and a node 
k e V — {i,j} and in the third pass it tests whether the edges {i, k), {k,j) are present in the stream. The 
number of draws that have to be done in order to get concentration (these draws are done in parallel), is of 
the order 

1 2 7Vf2T2. 
r = log(-)^(3 + ^^) 

Even if the term Tq is missing compared to the naive sampling, the graph has still to be fairly dense 
with respect to the number of triangles in order to get an e approximation with high probability. In the 
case of "power-law" networks it was shown in [M] that the spectral counting of triangles can be efficient 
due to their special spectral properties and [33] extended this idea using the randomized algorithm by 
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[12] by proposing a simple biased node sampling. This algorithm can be viewed as a special case of a 
streaming algorithm, since there exist algorithms, e.g., [ ], that perform a constant number of passes 
over the non-zero elements of the matrix to produce a good low rank matrix approximation. In [ ] the 
semi- streaming model for counting triangles is introduced, which allows log n passes over the edges. 
The key observation is that since counting triangles reduces to computing the intersection of two sets, 
namely the induced neighborhoods of two adjacent nodes, ideas from locality sensitivity hashing [6] are 
applicable to the problem. In [34] an algorithm which tosses a coin independently for each edge with 
probability p to keep the edge and probability q = 1 — p to throw it away is proposed. It was shown 
later by Tsourakakis, Kolountzakis and Miller [35] using a powerful theorem due to Kim and Vu [22] that 
under mild conditions on the triangle density the method results in a strongly concentrated estimate on the 
number of triangles. More recently, Avron proposed a new approximate triangle counting method based 
on a randomized algorithm for trace estimation [3]. 

2.2 Concentration of Measure 

In Section 3 we make extensive use of the following version of the Chemoff bound [8]. 

Theorem 1. Let Xi, X2, . . . , Xk be independently distributed {0, 1} variables with E[X.-^ = p. Then for 
any e > 0, we have 



2.3 Random Projections 

A random projecton x Rx from — )• approximately preserves all Euclidean distances. One version 
of the Johnson-Lindenstrauss lemma [18] is the following: 

Lemma 1 (Johnson Lindenstrauss). Suppose xi, . . . G and e > and take k = Ce"^logn. 
Define the random matrix R G M^'^" by taking all Rij ~ N{0, 1) (standard gaussian) and independent. 
Then, with probability bounded below by a constant the points yj = Rxj G M'^ satisfy 



fori,] = 1,2, . . . ,n. 

3 Proposed IMethod 

Our algorithm combines two approaches that have been taken on triangle counting: sparsify the graph 
by keeping a random subset of the edges [34,35] followed by a triple sampling using the idea of vertex 
partitioning due to Alon, Yuster and Zwick [2]. 

3.1 Edge Sparsification 

The following method was introduced in [3J] and was shown to perform very well in practice: keep each 
edge with probability p independently. Then for each triangle, the probability of it being kept is p^. So the 
expected number of triangles left is p^t. This is an inexpensive way to reduce the size of the graph as it 




(1 - e)\xi -Xj\< \yi - yj\ < (1 + e)\xi - Xj 
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can be done in one pass over the edge list using 0{mp) random variables (more details can be found in 
section 4.2 and [23]). 

In a later analysis [35], it was shown that the number of triangles in the sampled graph is concentrated 
around the actual triangle count as long as > f^i^)- Here we show a similar bound using more ele- 
mentary techniques. Suppose we have a set of k triangles such that no two share an edge, for each such 
triangle we define a random variable Xi which is 1 if the triangle is kept by the sampling and otherwise. 
Then as the triangles do not have any edges in common, the XjS are independent and take value with 
probability 1 — and 1 with probability p^. So by Chemoff bound, the concentration is bounded by: 



Pr 



i=l 



So when p^ke'^ > 4c/ log n, the probability of sparsification returning an e- approximation is at least 
1 — n^^. This is equivalent to p^k > (4(ilogn)/(e^), so to sample with small p and throw out many edges, 
we would like k to be large. To show that such a large set of independent triangles exist, we invoke the 
Hajnal-Szemeredi Theorem [16]: 

Lemma 2. (Hajnal-Szemeredi Theorem) Every graph with n vertices and maximum vertex degree at most 
k is k + 1 colorable with all color classes of size at least n/k. 

We can apply this theorem by considering the graph where each triangle is a vertex and two vertices 
representing triangles ti and t2 are connected iff they have an edge in common. Then vertices in this graph 
has degree at most 0{A), and we get: 

Corollary 1. Given t triangles where no edge belongs to more than A triangles, we can partition the 
triangles into Si ... Si such that \Si\ > Q{t/A) and I is bounded by 0{A). 

We can now bound what values of p can give concentration: 

Theorem 2. Ifp^ G then with probability 1 — n'^~'^, the sampled graph has a triangle count 

that e- approximates t. 

Proof. Consider the partition of triangles given by corollary 1 . By choice of p we get that the probability 
that the triangle count in each set is preserved within a factor of e/2 is at least 1 — n'^. Since there are at 
most such sets, an application of the union bounds gives that their total is approximated within a factor 
of e/2 with probability at least 1 — n'^"^. This gives that the triangle count is approximated within a factor 
of e with probability at least 1 — n'^~^. 



3.2 Triple Sampling 

Since each triangle corresponds to a triple of vertices, we can construct a set of triples that include all 
triangles, U . From this list, we can then sample some triples uniformly, let these samples be numbered 
from 1 to s. Also, for the z*'* triple sampled, let Xi be 1 it is a triangle and otherwise. Since we pick 
triples randomly from U and t of them are triangles, we have E[Xi) = and XiS are independent. So 
by Chernoff bound we obtain: 



So when s = i7(|f/|/tlogn/e^), we have X]i=i ^t/-^)\U\ approximates t within a factor of e with 
probability at least 1 — for any d of our choice. As < n^, this immediately gives an algorithm with 
runtime 0{n^ logn/(te^)) that approximates t within a factor of e. Slightly more careful bookkeeping can 
also give tighter bounds on |[/| in sparse graphs. 

Consider a triple containing vertex u, {u, v, w). Since uv, uw G E, we have the number of such triples 
involving u is at most deg(n)^. Also, as vw E E, another bound on the number of such triples is m. When 
deg(n)^ > m, or deg(n) > m^/^, the second bound is tighter, and the first is in the other case. 

These two cases naturally suggest that low degree vertices with degree at most m^/^ be treated sep- 
arately from high degree vertices with degree greater than m^/^. For the number of triangles around low 
degree vertices, since is concave, the value of Xl«^^g(^)^ maximized when all edges are concen- 
trated in as few vertices as possible. Since the maximum degree of such a vertex is m^/^, the number of 
such triangles is upper bounded by m^/^ ■ {rn}/'^Y = w?!'^. Also, as the sum of all degrees is 2m, there can 
be at most 2m^/'^ high degree vertices, which means the total number of triangles incident to these high 
degree vertices is at most 2m^/^ ■ m = Irr?!'^ . Combing these bounds give that \U\ can be upper bounded 
by 'im?!'^ . Note that this bound is asymptotically tight when G is a complete graph (n = m}/"^). However, 
in practice the second bound can be further reduced by summing over the degree of all v adjacent to u, 
becoming YIuv&e deg(^)- As a result, an algorithm that implicitly constructs U by picking the better one 
among these two cases by examining the degrees of all neighbors will achieve 



As our experimental data in section 4.1. indicate, the value of t is usually Q{m) in practice. In such 
cases, the second term in the above calculation becomes negligible compared to the first one. In fact, in 
most of our data, just sampling the first type of triples (aka. pretending all vertices are of low degree) 
brings the second term below the first. 

3.3 Hybrid algorithm 

Edge sparsification with a probability of p allows us to only work on 0{mp) edges, therefore the total 
runtime of the triple sampling algorithm after sparsification with probability p becomes: 



\U\ < Oim"^^^) 



This better bound on U gives an algorithm that e approximates the number of triangles in time: 





As stated above, since the first term in most practical cases are much larger, we can set the value of p 
to balance these two terms out: 



ni' 



log n 



pm = 



i^^'^ logn 
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The actual value of p picked would also depend heavily on constants in front of both terms, as sampling 
is likely much less expensive due to factors such as cache effect and memory efficiency. Nevertheless, our 
experimental results in section 4 does seem to indicate that this type of hybrid algorithms can perform 
better in certain situations. 



4 Experiments 
4.1 Data 

The graphs used in our experiments are shown in Table 2. Multiple edges and self loops were removed (if 
any). 



Name 


Nodes 


Edges 


Triangle Count 


Description 


AS-Skitter 


1,696,415 


11,095,298 


28,769,868 


Autonomous Systems 


Flickr 


1,861,232 


15,555,040 


548,658,705 


Person to Person 


Livejournal-links 


5,284,457 


48,709,772 


310,876,909 


Person to Person 


Orkut-links 


3,072,626 


116,586,585 


285,730,264 


Person to Person 


Soc-LiveJournal 


4,847,571 


42,851,237 


285,730,264 


Person to Person 


Web-EDU 


9,845,725 


46,236,104 


254,718,147 


Web Graph (page to page) 


Web-Google 


875,713 


3,852,985 


11,385,529 


Web Graph 


Wikipedia 2005/11 


1,634,989 


18,540,589 


44,667,095 


Web Graph (page to page) 


Wikipedia 2006/9 


2,983,494 


35,048,115 


84,018,183 


Web Graph (page to page) 


Wikipedia 2006/11 


3,148,440 


37,043,456 


88,823,817 


Web Graph (page to page) 


Wikipedia 2007/2 


3,566,907 


42,375,911 


102,434,918 


Web Graph (page to page) 


Youtube[ ] 


1,157,822 


2,990,442 


4,945,382 


Person to Person 



Table 2. Datasets used in our experiments. 



4.2 Experimental Setup and Implementation Details 

The experiments were performed on a single machine, with Intel Xeon CPU at 2.83 GHz, 6144KB cache 
size and and 50GB of main memory. The graphs are from real world web-graphs, some details regarding 
them are in the chart below. The algorithm as implemented in C++, and compiled using gcc version 
4.1.2 and the -03 optimization flag. Time was measured by taking the user time given by the linux time 
command. 10 times are included in that time since the amount of memory operations performend in setting 
up the graph is non-trivial. However, we use a modified 10 routine that's much faster than the standard 
C/C++ scanf. 

A major optimization that we used was to sort the edges in the graph and store the input file in the 
format as a sequence of neighbor lists per vertex. Each neighbor list begins with the size of the list, 
followed by the neighbors. This is similar to how softwares such as Matlab store sparse matrices, and the 
preprocessing time to change the data into this format is not counted. It can significantly improve the cache 
property of the graph stored, and therefore improving the performance. 

Some implementation details can be based on this graph storage format. Since each triple that we 
check already have 2 edges already in the graph, it suffices to check whether the 3rd edge in the graph. 
This can be done offline by comparing a smaller list of edges against the initial edge list of the graph and 
count the number of entries that they have in common. Once we sort the query list, the entire process can 
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be done offline in one pass through the graph. This also means that instead of picking a pre-determined 
sample rate for the triples, we can vary the sample rate for them so the number of queries is about the 
same as the size of the graph. Finally, in the next section we discuss the details behind efficient binomial 
sampling. Specifically picking a random subset of expected size p\S\ from a set S can be done in expected 
sublinear time [23]. 

Binomial Sampling in Expected Sublinear time Most of our algorithms have the following routine in 
their core: given a list of values, keep each of them with probability p and discard with probability 1 — p. 
If the list has length n, this can clearly be done using n random variables. As generating random variables 
can be expensive, it's preferrable to use 0{np) random variables in expectation if possible. One possibility 
is to pick 0{np) random elements, but this would likely involve random accesses in the list, or maintaining 
a list of the indices picked in sorted order. A simple way that we use in our code to perform this sampling 
is to generate the differences between indices of entries retained [ ]. This variable clearly belongs to an 
exponential distribution, and if x is a uniform random number in (0, 1), taking [log^^^p) x\ . The primary 
advantage of doing so is that sampling can be done while accessing the data in a sequential fashion, which 
results in much better cache performances. 

4.3 Results 

The six variants of the code involved in the experiment are first separated by whether the graph was first 
sparsified by keeping each edge with probability p = . 1 . In either case, an exact algorithm based on hybrid 
sampling with performance bounded by 0(m'^/^) is ran. Then two triple based sampling algorithms are 
also considered. They differ in whether an attempt to distinguish between low and high degree vertices, so 
the simple version is essentially sampling all 'V shaped triples off each vertex. Note that no sparsification 
and exact also generates the exactly number of triangles. Errors are measured by the absolute value of the 
difference between the value produced and the exact number of triangles divided by the exact number. The 
results on error and running time are averages over five runs. Results on these graphs described above are, 
the methods listed in the columns listed in Table 3. 





No Sparsification 




Sparsifiec 


(P = 


.1) 




Graph 


Exact 


Simple 


Hybrid 


Exact 


Simple 


Hybrid 




err(%) 


time 


err(%) 


time 


err(%) 


time 


eiT(%) 


time 


err(%) 


time 


err(%) 


time 


AS-Skitter 


0.000 


4.452 


1.308 


0.746 


0.128 


1.204 


2.188 


0.641 


3.208 


0.651 


1.388 


0.877 


Flickr 


0.000 


41.981 


0.166 


1.049 


0.128 


2.016 


0.530 


1.389 


0.746 


0.860 


0.818 


1.033 


Livejoumal-links 


0.000 


50.828 


0.309 


2.998 


0.116 


9.375 


0.242 


3.900 


0.628 


2.518 


1.011 


3.475 


Orkut-links 


0.000 


202.012 


0.564 


6.208 


0.286 


21.328 


0.172 


9.881 


1.980 


5.322 


0.761 


7.227 


Soc-LiveJournal 


0.000 


38.271 


0.285 


2.619 


0.108 


7.451 


0.681 


3.493 


0.830 


2.222 


0.462 


2.962 


Web-EDU 


0.000 


8.502 


0.157 


2.631 


0.047 


3.300 


0.571 


2.864 


0.771 


2.354 


0.383 


2.732 


Web-Google 


0.000 


1.599 


0.286 


0.379 


0.045 


0.740 


1.112 


0.251 


1.262 


0.371 


0.264 


0.265 


Wiki-2005 


0.000 


32.472 


0.976 


1.197 


0.318 


3.613 


1.249 


1.529 


7.498 


1.025 


0.695 


1.313 


Wiki-2006/9 


0.000 


86.623 


0.886 


2.250 


0.361 


7.483 


0.402 


3.431 


6.209 


1.843 


2.091 


2.598 


Wiki-2006/11 


0.000 


96.114 


1.915 


2.362 


0.530 


7.972 


0.634 


3.578 


4.050 


1.947 


0.950 


2.778 


Wiki-2007 


0.000 


122.395 


0.943 


2.728 


0.178 


9.268 


0.819 


4.407 


3.099 


2.224 


1.448 


3.196 


Youtube 


0.000 


1.347 


1.114 


0.333 


0.127 


0.500 


1.358 


0.210 


5.511 


0.302 


1.836 


0.268 



Table 3. Results of Experiments Averaged Over 5 Trials 
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4.4 Remarks 



From Table 3 it is clear that none of the variants clearly outperforms the others on all the data. The gain/loss 
from sparsification are likely due to the fixed sampling rate, so varying it as in earlier works [34] are likely 
to mitigate this discrepancy. The difference between simple and hybrid sampling are due to the fact that 
handling the second case of triples has a much worse cache access pattern as it examines vertices that are 
two hops away. There are alternative implementations of how to handle this situation, which would be 
interesting for future implementations. A fixed sparsification rate ofp= 10% was used mostly to simplify 
the setups of the experiments. In practice varying p to look for a rate where the result stabalizes is the 
preferred option [^5]. 

When compared with previous results on this problem, the error rates and running times of our results 
are all significantly lower. In fact, on the wiki graphs our exact counting algorithms have about the same 
order of speed with other appoximate triangle counting implementations. 



5 Theoretical Ramifications 
5.1 Random Projections and Triangles 

Consider any two vertices i, j E V which are connected, i.e., (z, j) G E. Observe that the inner product of 
the i-th and j-th column of the adjacency matrix of graph G gives the number of triangles that edge {i, j) 
participates in. Viewing the adjacency matrix as a collection of n points in M", a natural question to ask is 
whether we can use results from the theory of random projections [ 1 8] to reduce the dimensionality of the 
points while preserving the inner products which contribute to the count of triangles. Magen and Zouzias 
[25] have considered a similar problem, namely random projections which preserve approximately the 
volume for all subsets of at most k points. 

According to the lemma 1, a random projecton x — )■ Rx from — t- M'^ approximately preserves all 
Euclidean distances. However it does not preserve all pairwise inner products. This can easily be seen by 
considering the set of points 

ei,...,e„ e M" = M'^. 

where ei = (1, 0, . . . , 0) etc. Indeed, all inner products of the above set are zero, which cannot happen 
for the points Rej as they belong to a lower dimensional space and they cannot all be orthogonal. For the 
triangle counting problem we do not need to approximate all inner products. Suppose A E {0, 1}" is the 
adjacency matrix of a simple undirected graph G with vertex set V{G) = {1,2, ... ,n} and write Ai for 
the i-the column of A. The quantity we are interested in is the number of triangles in G (actually six times 
the number of triangles) t = T.u,v,wev{G) AuvA^^A^u- 

If we apply a random projection of the above kind to the columns of A A^ — )• RAi and write X = 
J2u V wev{G)i-^^)uviR^)vwiRA)uiu it is easy to see that E [X] =0 since X is a linear combination of 
triple products RijRkiRrs of entries of the random matrix R and that all such products have expected 
value 0, no matter what the indices. So we cannot expect this kind of random projection to work. 
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Therefore we consider the following approach which still has limitations as we will show in the fol- 
lowing. Let t = X^Mn^t, ^u^v, where u v means Auv = 1, and look at the quantity 

k ri / \ 

= I ^iuAjy \ RiiRij 

1=1 i,j=l / 
k n 

1=1 i,j=i 

This is a quadratic form in the gaussian A^(0, 1) variables Rij. By simple calculation for the mean value 
and diagonalization for the variance we see that if the Xj are independent A^(0, 1) variables and 

Z = X^BX, 

where X = (Xi, . . . , Xn^ and B G M"^" is symmetric, that 

E [Z] = TiB 

n 

War [Z] = Tr = ^ (5,^)^ 

Hence E [Y] = Yl^=i Z]r=i ?^{^ — * — * — i} = k ■ t so the mean value is the quantity we want (multi- 
plied by k). For this to be useful we should have some concentration for Y near E [Y]. We do not need 
exponential tails because we have only one quantity to control. In particular, a statement of the following 
type 

Pr[|F-E[F]| > eE[Y]] < 1 - c„ 

where Ce > would be enough. The simplest way to check this is by computing the standard deviation of 
Y. By Chebyshev's inequality it suffices that the standard deviation be much smaller than E [Y] . According 
to the formula above for the variance of a quadratic form we get 

k n 

Yar [F] = ^ ^ #{z - * - * - ^}' 

1=1 i,j=i 

= C ■ k ■ #{x — * — * — * — * — * — x} = 
= C ■ k ■ (number of circuits of length 6 in G). 

Therefore, to have concentration it is sufficient that 

Yar[Y] = o{k- {E[Y]y). (1) 

Observe that (1) is a sufficient -and not necessary- condition. Furthermore,(l) is certainly not always 
true as there are graphs with many 6-circuits and no triangles at all (the circuits may repeat vertices or 
edges). 
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5.2 Sampling in the Semi- Streaming Model 

The previous analysis of triangle counting by Alon, Yuster and Zwick was done in the streaming model 
[2], where the assumption was constant space overhead. We show that our sampling algorithm can be done 
in a slightly weaker model with space usage equaling: 



O ^m^^^ logn + 



77^3/2 J ^ 



te2 

We assume the edges adjacent to each vertex are given in order [15]. We first need to identify high de- 
gree vertices, specifically the ones with degree higher than m^/^. This can be done by sampling 0(m^/^ logn) 
edges and recording the vertices that are endpoints of one of those edges. 

Lemma 3. Suppose dra^/'^ log n samples were taken, then the probability of all vertices with degree at 
least vn}!'^ being chosen is at least 1 — n~'^^^. 

Proof. Consider some vertex v with degree at least m^/^. The probability of it being picked in each iter- 
ation is at least m^/^/m = m~^/^. As a result, the probability of it not picked in dvn}!'^ logn iterations 
is: 



^-l/2\dml/2 logn 



(1 _^l/2)m 



1/2 



dlogn 



d log n 

< I - 1 =n-^ 



As there are at most n vertices, applying union bound gives that all vertices with degree at least m^/^ are 
sampled with probability at least 1 — rT'^^^ . □ 

This requires one pass of the graph. Note that the number of such candidates for high degree vertices 
can be reduced to m^/^ using another pass over the edge list. 

For all the low degree vertices, we can read their 0(m^/^) neighbors and sample them. For the high 
degree vertices, we do the following: for each edge, obtain a random variable y from a binomial distribution 
equal to the number of edge/vertices pairs that this edge is involved in. Then pick y vertices from the list 
of high degree vertices randomly. These two sampling procedures can be done together in another pass 
over the data. 

Finally, we need to check whether each edge in the sampled triples belong to the edge list. We can 
store all such queries into a hash table as there are at most 0( ™'''^^^J°^" ) edges sampled w.h.p. Then going 
through the graph edges in a single pass and looking them up in table yields the desired answer. 



6 Conclusions & Future Work 

In this work, we extended previous work [34,33] by introducing the powerful idea of Alon, Yuster and 
Zwick [ ]. Specifically, we propose a Monte Carlo algorithm which approximates the true number of 
triangles within e and runs in O (^m + nfL-^JlA^ time. Our method can be extended to the semi- streaming 

model using three passes and a memory overhead of O ^m^/^ log n + mi^Lj^nA^ 

In practice our methods obtain excellent running times, typically few seconds for graphs with several 
millions of edges. The accuracy is also satisfactory, especially for the type of applications we are concerned 
with. Finally, we propose a random projection based method for triangle counting and provide a sufficient 
condition to obtain an estimate with low variance. A natural question is the following: can we provide 
some reasonable condition on G that would guarantee (1)? Finally, since our proposed methods are easily 
parallelizable, developing such an implementation in the MapReduce framework, see [1.] and [21,20], 
is an natural practical direction. 
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